1 Wstęp

Poniższy raport ma na celu podjęcie analizy danych pochodzących z bazy Protein Data Bank (PDB). Dane zawierają informacje na temat ligandów. Zbór danych zawiera między innymi nazwę danej cząsteczki chemicznej, ilość atomów oraz elektronów oraz inne kolumny oparte o trójwymiarowy fragment gęstości elektronowej struktury. Przy analizie pominięte zostały kolumny utworzone przy pomocy wartości słownikowych. Ze względu na problem ze środowiskiem, zbiór początkowy ograniczono do 400 000 wierszy.

2 Użyte biblioteki

library(EDAWR)
library(dplyr)
library(DT)
library(ggplot2)
library(plotly)
library(reshape2)
library(cowplot)
library(data.table)
library(qwraps2)
library(fastDummies)
library(reshape2)
library(caret)
library(kableExtra)
library(pROC)

3 Powtarzalność wyników.

set.seed(123)

4 Wczytywanie danych z pliku.

initial<-fread("all_summary.csv", nrows = 100)
colClass <- sapply(initial, class)
pdb_table<-fread("all_summary.csv", nrows = 400000, colClasses = colClass)

5 Usuwanie wierszy z wybrana wartością res_name.

6 Przetwarzanie brakujących danych.

for(i in 1:ncol(pdb_clear_res_name_table)){
  pdb_clear_res_name_table[is.na(pdb_clear_res_name_table[,i]), i] <- mean(pdb_clear_res_name_table[,i], na.rm = TRUE)
}

Ze zbioru usuniete zostaly wiersze posiadajace wartosci zmiennej res_name rozne od: UNK, UNX, UNL, DUM, N, BLOB, ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS, MET, MSE, PHE, PRO, SEC, SER, THR, TRP, TYR, VAL, DA, DG, DT, DC, DU, A, G, T, C, U, HOH, H20, WAT. Zbior ograniczono do kolumn opisanych na stronie projektu, nie uwzgledniono kolumn nie wykorzystywanych do klasyfikacji poza kolumnami res_name, local_res_atom_non_h_count, local_res_atom_non_h_count, dict_atom_non_h_count, dict_atom_non_h_electron_sum. Wartosci ‘Na’ zostaly zastapione srednia wartoscia dla danej kolumny.

7 Podsumowanie zbioru.

Zbior przed wyczyszczeniem posiadał wymiar: 400000, 412 [wierszy, kolumn]. Po oczyszczeniu zbioru wymiary wynosza: 393259, 336 [wierszy, kolumn].

res_name
Length:393259
Class :character
Mode :character
mean sd median min max class
local_res_atom_non_h_count 1.353000e+01 1.514000e+01 6.000000e+00 1.00 1.060000e+02 numeric
local_res_atom_non_h_electron_sum 1.003800e+02 1.019700e+02 4.800000e+01 3.00 1.848000e+03 numeric
dict_atom_non_h_count 1.387000e+01 1.561000e+01 6.000000e+00 1.00 1.260000e+02 numeric
dict_atom_non_h_electron_sum 1.029000e+02 1.045600e+02 5.200000e+01 3.00 1.223000e+03 numeric
local_volume 8.564600e+02 1.467780e+03 3.438200e+02 49.25 9.095251e+04 numeric
local_electrons 1.769000e+01 2.526000e+01 7.770000e+00 0.01 4.424400e+02 numeric
local_mean 2.000000e-02 2.000000e-02 2.000000e-02 0.00 3.700000e-01 numeric
local_std 1.200000e-01 1.000000e-01 1.000000e-01 0.00 1.960000e+00 numeric
local_min 0.000000e+00 0.000000e+00 0.000000e+00 0.00 0.000000e+00 numeric
local_max 1.350000e+00 1.590000e+00 8.900000e-01 0.03 4.463000e+01 numeric
local_skewness 2.200000e-01 1.900000e-01 1.700000e-01 0.01 4.040000e+00 numeric
part_00_shape_segments_count 3.388500e+02 1.110370e+03 2.500000e+01 0.00 1.145770e+05 numeric
part_00_density_segments_count 3.388500e+02 1.110370e+03 2.500000e+01 0.00 1.145770e+05 numeric
part_00_volume 3.287000e+01 5.051000e+01 1.422000e+01 0.00 2.427940e+03 numeric
part_00_electrons 1.749000e+01 2.511000e+01 7.710000e+00 0.00 4.411400e+02 numeric
part_00_mean 6.000000e-01 4.000000e-01 5.100000e-01 0.00 8.600000e+00 numeric
part_00_std 2.100000e-01 3.000000e-01 1.200000e-01 0.00 8.010000e+00 numeric
part_00_max 1.350000e+00 1.590000e+00 8.900000e-01 0.00 4.463000e+01 numeric
part_00_max_over_std 9.740000e+00 7.600000e+00 7.220000e+00 0.00 1.732500e+02 numeric
part_00_skewness 2.100000e-01 3.400000e-01 1.100000e-01 0.00 1.051000e+01 numeric
part_00_parts 1.070000e+00 3.800000e-01 1.000000e+00 0.00 2.800000e+01 numeric
part_00_shape_O3 1.695558e+06 9.850960e+06 9.601052e+04 121.39 2.293302e+09 numeric
part_00_shape_O4 1.116615e+13 9.992375e+14 2.274156e+09 3807.05 4.027551e+17 numeric
part_00_shape_O5 1.916881e+20 5.147942e+22 1.481914e+13 33777.03 2.908412e+25 numeric
part_00_shape_FL 6.023041e+16 1.439510e+19 3.770154e+10 75.70 5.852600e+21 numeric
part_00_shape_O3_norm 4.900000e-01 3.300000e-01 3.800000e-01 0.23 3.965000e+01 numeric
part_00_shape_O4_norm 6.000000e-02 8.000000e-02 3.000000e-02 0.02 6.010000e+00 numeric
part_00_shape_O5_norm 0.000000e+00 0.000000e+00 0.000000e+00 0.00 4.100000e-01 numeric
part_00_shape_FL_norm 6.000000e-02 5.900000e-01 1.000000e-02 0.00 1.893500e+02 numeric
part_00_shape_I1 3.252631e+09 3.464861e+11 7.712680e+06 528.82 1.633222e+14 numeric
part_00_shape_I2 2.919856e+20 7.877431e+22 8.701642e+12 56939.71 3.482558e+25 numeric
part_00_shape_I3 1.178734e+23 4.974270e+25 2.430589e+13 121692.16 2.665214e+28 numeric
part_00_shape_I4 3.480335e+16 7.655498e+18 2.063800e+10 42.47 2.867703e+21 numeric
part_00_shape_I5 1.785198e+16 3.389322e+18 6.199766e+09 6.19 1.540703e+21 numeric
part_00_shape_I6 2.049683e+18 7.101063e+20 3.367402e+11 28626.18 3.743294e+23 numeric
part_00_shape_I1_norm 5.700000e-01 6.750000e+00 2.300000e-01 0.06 2.760570e+03 numeric
part_00_shape_I2_norm 9.000000e-02 1.100000e+00 1.000000e-02 0.00 3.039100e+02 numeric
part_00_shape_I3_norm 4.553000e+01 1.461936e+04 3.000000e-02 0.00 7.617375e+06 numeric
part_00_shape_I4_norm 4.000000e-02 5.800000e-01 0.000000e+00 0.00 1.890700e+02 numeric
part_00_shape_I5_norm 3.000000e-02 5.800000e-01 0.000000e+00 0.00 1.888900e+02 numeric
part_00_shape_I6_norm 1.120000e+00 2.293800e+02 4.000000e-02 0.00 1.094147e+05 numeric
part_00_shape_M000 4.109220e+03 6.313310e+03 1.778000e+03 38.00 3.034930e+05 numeric
part_00_shape_CI 4.000000e-02 4.120000e+00 0.000000e+00 -129.45 7.004000e+01 numeric
part_00_shape_E3_E1 2.400000e-01 2.000000e-01 1.700000e-01 0.00 9.900000e-01 numeric
part_00_shape_E2_E1 4.200000e-01 2.400000e-01 3.800000e-01 0.00 1.000000e+00 numeric
part_00_shape_E3_E2 5.500000e-01 2.300000e-01 5.700000e-01 0.01 1.000000e+00 numeric
part_00_shape_sqrt_E1 8.030000e+00 5.950000e+00 5.870000e+00 1.24 2.027600e+02 numeric
part_00_shape_sqrt_E2 4.420000e+00 2.730000e+00 3.510000e+00 0.74 3.452000e+01 numeric
part_00_shape_sqrt_E3 2.940000e+00 1.420000e+00 2.580000e+00 0.60 1.993000e+01 numeric
part_00_density_O3 7.961914e+05 2.894272e+06 4.686123e+04 9.71 4.315815e+08 numeric
part_00_density_O4 1.360434e+12 3.255212e+13 5.451315e+08 24.38 1.207528e+16 numeric
part_00_density_O5 1.589070e+18 1.517293e+20 1.722258e+12 17.32 5.920612e+22 numeric
part_00_density_FL 3.306074e+15 6.223221e+17 8.229923e+09 -3.01 3.058302e+20 numeric
part_00_density_O3_norm 7.500000e-01 1.070000e+00 6.100000e-01 0.04 4.123300e+02 numeric
part_00_density_O4_norm 1.500000e-01 2.200000e-01 9.000000e-02 0.00 3.293000e+01 numeric
part_00_density_O5_norm 1.000000e-02 2.000000e-02 0.000000e+00 0.00 4.430000e+00 numeric
part_00_density_FL_norm 3.800000e-01 4.728000e+01 2.000000e-02 -0.03 2.927105e+04 numeric
part_00_density_I1 1.058597e+09 3.301963e+10 3.414976e+06 42.22 1.282083e+13 numeric
part_00_density_I2 1.219865e+19 2.907814e+21 1.725746e+12 363.10 1.380658e+24 numeric
part_00_density_I3 1.003384e+21 3.220425e+23 4.769437e+12 775.23 1.642551e+26 numeric
part_00_density_I4 1.996255e+15 3.413131e+17 4.943933e+09 -1.01 1.713566e+20 numeric
part_00_density_I5 1.123042e+15 1.563157e+17 1.945365e+09 0.18 8.170753e+19 numeric
part_00_density_I6 3.604626e+16 7.467917e+18 7.076168e+10 182.74 2.905409e+21 numeric
part_00_density_I1_norm 2.940000e+00 5.121100e+02 5.800000e-01 0.00 2.985910e+05 numeric
part_00_density_I2_norm 7.900000e-01 2.053000e+01 5.000000e-02 0.00 7.616410e+03 numeric
part_00_density_I3_norm 2.621333e+05 1.425401e+08 1.500000e-01 0.00 8.911718e+10 numeric
part_00_density_I4_norm 3.100000e-01 4.721000e+01 1.000000e-02 -0.01 2.923185e+04 numeric
part_00_density_I5_norm 2.700000e-01 4.716000e+01 0.000000e+00 0.00 2.920571e+04 numeric
part_00_density_I6_norm 4.339200e+02 1.991169e+05 1.700000e-01 0.00 1.230801e+08 numeric
part_00_density_M000 2.186050e+03 3.139180e+03 9.634300e+02 3.05 5.514218e+04 numeric
part_00_density_CI 4.000000e-02 4.700000e+00 0.000000e+00 -155.70 8.996000e+01 numeric
part_00_density_E3_E1 2.500000e-01 2.000000e-01 1.700000e-01 0.00 1.000000e+00 numeric
part_00_density_E2_E1 4.200000e-01 2.500000e-01 3.800000e-01 0.00 1.000000e+00 numeric
part_00_density_E3_E2 5.600000e-01 2.300000e-01 5.800000e-01 0.01 1.000000e+00 numeric
part_00_density_sqrt_E1 7.720000e+00 5.840000e+00 5.550000e+00 1.24 2.024800e+02 numeric
part_00_density_sqrt_E2 4.200000e+00 2.630000e+00 3.270000e+00 0.74 3.280000e+01 numeric
part_00_density_sqrt_E3 2.780000e+00 1.340000e+00 2.430000e+00 0.60 1.938000e+01 numeric
part_00_shape_Z_7_3 4.093000e+01 3.645000e+01 2.658000e+01 6.30 5.587100e+02 numeric
part_00_shape_Z_0_0 2.615000e+01 1.723000e+01 2.060000e+01 3.01 2.691700e+02 numeric
part_00_shape_Z_7_0 1.735000e+01 1.671000e+01 1.015000e+01 0.85 3.669900e+02 numeric
part_00_shape_Z_7_1 2.813000e+01 2.593000e+01 1.757000e+01 3.66 4.461400e+02 numeric
part_00_shape_Z_3_0 1.505000e+01 1.296000e+01 1.066000e+01 0.50 2.081300e+02 numeric
part_00_shape_Z_5_2 3.493000e+01 2.908000e+01 2.470000e+01 4.58 4.551000e+02 numeric
part_00_shape_Z_6_1 3.166000e+01 2.848000e+01 2.085000e+01 1.81 4.762100e+02 numeric
part_00_shape_Z_3_1 2.437000e+01 1.902000e+01 1.807000e+01 2.51 2.972800e+02 numeric
part_00_shape_Z_6_0 1.479000e+01 1.402000e+01 9.890000e+00 0.02 2.990100e+02 numeric
part_00_shape_Z_2_1 3.825000e+01 2.770000e+01 2.870000e+01 2.75 4.208100e+02 numeric
part_00_shape_Z_6_3 4.648000e+01 4.064000e+01 3.112000e+01 4.11 6.084300e+02 numeric
part_00_shape_Z_2_0 2.809000e+01 1.984000e+01 2.179000e+01 1.32 3.265300e+02 numeric
part_00_shape_Z_6_2 4.197000e+01 3.728000e+01 2.791000e+01 2.94 5.622000e+02 numeric
part_00_shape_Z_5_0 1.827000e+01 1.712000e+01 1.218000e+01 0.88 3.155700e+02 numeric
part_00_shape_Z_5_1 2.887000e+01 2.459000e+01 2.043000e+01 3.46 4.074900e+02 numeric
part_00_shape_Z_4_2 4.374000e+01 3.569000e+01 3.133000e+01 3.58 5.344700e+02 numeric
part_00_shape_Z_1_0 1.430000e+00 2.100000e-01 1.410000e+00 0.74 2.400000e+00 numeric
part_00_shape_Z_4_1 3.763000e+01 3.120000e+01 2.698000e+01 1.95 4.656000e+02 numeric
part_00_shape_Z_7_2 3.638000e+01 3.305000e+01 2.330000e+01 5.96 5.303200e+02 numeric
part_00_shape_Z_4_0 2.046000e+01 1.753000e+01 1.485000e+01 0.03 3.130900e+02 numeric
part_00_density_Z_7_3 3.030000e+01 2.754000e+01 1.942000e+01 2.89 2.059200e+02 numeric
part_00_density_Z_0_0 1.904000e+01 1.262000e+01 1.517000e+01 0.85 1.147400e+02 numeric
part_00_density_Z_7_0 1.486000e+01 1.374000e+01 8.520000e+00 0.98 1.269200e+02 numeric
part_00_density_Z_7_1 2.230000e+01 2.042000e+01 1.391000e+01 2.88 1.599200e+02 numeric
part_00_density_Z_3_0 1.129000e+01 9.710000e+00 7.760000e+00 0.42 8.854000e+01 numeric
part_00_density_Z_5_2 2.552000e+01 2.161000e+01 1.772000e+01 2.15 1.821800e+02 numeric
part_00_density_Z_6_1 2.469000e+01 2.254000e+01 1.713000e+01 0.51 1.746700e+02 numeric
part_00_density_Z_3_1 1.729000e+01 1.388000e+01 1.244000e+01 1.44 1.208600e+02 numeric
part_00_density_Z_6_0 1.253000e+01 1.254000e+01 8.040000e+00 0.01 1.224900e+02 numeric
part_00_density_Z_2_1 2.801000e+01 2.003000e+01 2.147000e+01 0.91 1.762500e+02 numeric
part_00_density_Z_6_3 3.422000e+01 3.068000e+01 2.352000e+01 1.16 2.600800e+02 numeric
part_00_density_Z_2_0 2.156000e+01 1.486000e+01 1.696000e+01 0.51 1.352800e+02 numeric
part_00_density_Z_6_2 3.148000e+01 2.846000e+01 2.167000e+01 0.85 2.355200e+02 numeric
part_00_density_Z_5_0 1.489000e+01 1.344000e+01 9.730000e+00 0.87 1.182300e+02 numeric
part_00_density_Z_5_1 2.185000e+01 1.860000e+01 1.519000e+01 2.14 1.670800e+02 numeric
part_00_density_Z_4_2 3.225000e+01 2.614000e+01 2.349000e+01 1.01 2.284400e+02 numeric
part_00_density_Z_1_0 1.420000e+00 2.200000e-01 1.390000e+00 0.68 2.400000e+00 numeric
part_00_density_Z_4_1 2.851000e+01 2.309000e+01 2.093000e+01 0.76 1.810600e+02 numeric
part_00_density_Z_7_2 2.760000e+01 2.528000e+01 1.751000e+01 2.89 1.951200e+02 numeric
part_00_density_Z_4_0 1.700000e+01 1.404000e+01 1.273000e+01 0.01 1.196500e+02 numeric
part_01_shape_segments_count 2.830100e+02 9.662200e+02 1.500000e+01 0.00 6.920200e+04 numeric
part_01_density_segments_count 2.830100e+02 9.662200e+02 1.500000e+01 0.00 6.920200e+04 numeric
part_01_volume 2.525000e+01 4.106000e+01 1.031000e+01 0.00 1.996250e+03 numeric
part_01_electrons 1.504000e+01 2.287000e+01 6.120000e+00 0.00 3.957000e+02 numeric
part_01_mean 6.500000e-01 4.300000e-01 5.600000e-01 0.00 8.860000e+00 numeric
part_01_std 2.000000e-01 3.000000e-01 1.100000e-01 0.00 8.080000e+00 numeric
part_01_max 1.350000e+00 1.590000e+00 8.900000e-01 0.00 4.463000e+01 numeric
part_01_max_over_std 9.710000e+00 7.640000e+00 7.220000e+00 0.00 1.732500e+02 numeric
part_01_skewness 2.000000e-01 3.400000e-01 1.000000e-01 0.00 1.077000e+01 numeric
part_01_parts 1.270000e+00 7.000000e-01 1.000000e+00 0.00 2.400000e+01 numeric
part_01_shape_O3 1.276332e+06 7.453268e+06 5.937043e+04 74.84 1.837172e+09 numeric
part_01_shape_O4 5.888376e+12 4.679577e+14 8.720523e+08 1818.62 1.680769e+17 numeric
part_01_shape_O5 5.793748e+19 1.314826e+22 3.512181e+12 13532.12 5.890566e+24 numeric
part_01_shape_FL 3.084952e+16 7.231544e+18 1.046845e+10 0.00 2.872585e+21 numeric
part_01_shape_O3_norm 5.300000e-01 4.300000e-01 3.700000e-01 0.23 4.375000e+01 numeric
part_01_shape_O4_norm 7.000000e-02 1.100000e-01 3.000000e-02 0.02 1.175000e+01 numeric
part_01_shape_O5_norm 0.000000e+00 1.000000e-02 0.000000e+00 0.00 1.230000e+00 numeric
part_01_shape_FL_norm 1.400000e-01 2.230000e+00 0.000000e+00 0.00 5.526100e+02 numeric
part_01_shape_I1 2.410654e+09 2.754567e+11 3.905652e+06 210.50 1.307399e+14 numeric
part_01_shape_I2 1.474496e+20 4.067568e+22 2.238828e+12 10919.68 1.923967e+25 numeric
part_01_shape_I3 7.544764e+22 3.200528e+25 6.116160e+12 9454.58 1.708172e+28 numeric
part_01_shape_I4 1.783789e+16 3.751691e+18 5.856015e+09 0.00 1.407712e+21 numeric
part_01_shape_I5 9.163466e+15 1.520986e+18 1.619440e+09 0.00 4.931548e+20 numeric
part_01_shape_I6 1.273856e+18 4.557894e+20 1.045825e+11 5359.46 2.400810e+23 numeric
part_01_shape_I1_norm 7.700000e-01 8.740000e+00 2.200000e-01 0.06 3.426140e+03 numeric
part_01_shape_I2_norm 2.400000e-01 1.197000e+01 1.000000e-02 0.00 6.729610e+03 numeric
part_01_shape_I3_norm 7.693000e+01 2.330541e+04 2.000000e-02 0.00 1.173424e+07 numeric
part_01_shape_I4_norm 1.100000e-01 2.270000e+00 0.000000e+00 0.00 6.056100e+02 numeric
part_01_shape_I5_norm 1.000000e-01 2.310000e+00 0.000000e+00 0.00 6.409500e+02 numeric
part_01_shape_I6_norm 1.790000e+00 3.274700e+02 4.000000e-02 0.00 1.498618e+05 numeric
part_01_shape_M000 3.186450e+03 5.123700e+03 1.328000e+03 32.00 2.495310e+05 numeric
part_01_shape_CI 3.000000e-02 4.530000e+00 0.000000e+00 -142.64 7.127000e+01 numeric
part_01_shape_E3_E1 2.500000e-01 2.100000e-01 1.800000e-01 0.00 9.900000e-01 numeric
part_01_shape_E2_E1 4.300000e-01 2.500000e-01 4.000000e-01 0.00 1.000000e+00 numeric
part_01_shape_E3_E2 5.700000e-01 2.300000e-01 5.900000e-01 0.01 1.000000e+00 numeric
part_01_shape_sqrt_E1 7.450000e+00 5.990000e+00 5.270000e+00 0.93 2.023700e+02 numeric
part_01_shape_sqrt_E2 4.030000e+00 2.730000e+00 3.160000e+00 0.53 3.206000e+01 numeric
part_01_shape_sqrt_E3 2.660000e+00 1.410000e+00 2.350000e+00 0.37 1.906000e+01 numeric
part_01_density_O3 6.719290e+05 2.504688e+06 3.307433e+04 5.07 3.763261e+08 numeric
part_01_density_O4 9.962197e+11 2.296292e+13 2.639872e+08 6.52 8.774435e+15 numeric
part_01_density_O5 9.393053e+17 8.694761e+19 5.760985e+11 1.74 3.619558e+22 numeric
part_01_density_FL 2.372681e+15 4.490618e+17 2.842058e+09 -11.69 2.206107e+20 numeric
part_01_density_O3_norm 7.700000e-01 1.090000e+00 5.700000e-01 0.04 3.097900e+02 numeric
part_01_density_O4_norm 1.600000e-01 3.000000e-01 8.000000e-02 0.00 3.095000e+01 numeric
part_01_density_O5_norm 1.000000e-02 3.000000e-02 0.000000e+00 0.00 3.740000e+00 numeric
part_01_density_FL_norm 7.900000e-01 2.885000e+01 1.000000e-02 -0.04 1.474174e+04 numeric
part_01_density_I1 8.851343e+08 2.810244e+10 2.011426e+06 23.86 1.098793e+13 numeric
part_01_density_I2 8.447497e+18 2.080879e+21 5.998141e+11 81.51 9.956763e+23 numeric
part_01_density_I3 7.378980e+20 2.363363e+23 1.584537e+12 186.52 1.206633e+26 numeric
part_01_density_I4 1.469681e+15 2.535361e+17 1.703236e+09 -3.02 1.278331e+20 numeric
part_01_density_I5 8.676808e+14 1.250201e+17 6.184899e+08 0.00 6.598137e+19 numeric
part_01_density_I6 2.664503e+16 5.527029e+18 2.976817e+10 63.49 2.140061e+21 numeric
part_01_density_I1_norm 3.090000e+00 3.784700e+02 5.100000e-01 0.00 1.711294e+05 numeric
part_01_density_I2_norm 2.250000e+00 1.563800e+02 4.000000e-02 0.00 7.025807e+04 numeric
part_01_density_I3_norm 1.445226e+05 5.385298e+07 1.100000e-01 0.00 2.926873e+10 numeric
part_01_density_I4_norm 7.200000e-01 2.927000e+01 1.000000e-02 -0.02 1.478973e+04 numeric
part_01_density_I5_norm 6.700000e-01 2.965000e+01 0.000000e+00 0.00 1.482173e+04 numeric
part_01_density_I6_norm 3.171300e+02 1.042647e+05 1.300000e-01 0.00 5.299373e+07 numeric
part_01_density_M000 1.897890e+03 2.852510e+03 7.915200e+02 1.44 4.946192e+04 numeric
part_01_density_CI 4.000000e-02 5.090000e+00 0.000000e+00 -162.58 1.049000e+02 numeric
part_01_density_E3_E1 2.600000e-01 2.100000e-01 1.800000e-01 0.00 1.000000e+00 numeric
part_01_density_E2_E1 4.300000e-01 2.500000e-01 4.000000e-01 0.00 1.000000e+00 numeric
part_01_density_E3_E2 5.700000e-01 2.400000e-01 5.900000e-01 0.01 1.000000e+00 numeric
part_01_density_sqrt_E1 7.190000e+00 5.870000e+00 4.960000e+00 0.93 2.021700e+02 numeric
part_01_density_sqrt_E2 3.840000e+00 2.640000e+00 2.970000e+00 0.53 3.077000e+01 numeric
part_01_density_sqrt_E3 2.530000e+00 1.340000e+00 2.230000e+00 0.37 1.869000e+01 numeric
part_01_shape_Z_7_3 3.578000e+01 3.373000e+01 2.195000e+01 4.61 4.702800e+02 numeric
part_01_shape_Z_0_0 2.243000e+01 1.598000e+01 1.781000e+01 2.76 2.440700e+02 numeric
part_01_shape_Z_7_0 1.597000e+01 1.540000e+01 8.700000e+00 0.71 2.915600e+02 numeric
part_01_shape_Z_7_1 2.494000e+01 2.397000e+01 1.439000e+01 3.42 3.709900e+02 numeric
part_01_shape_Z_3_0 1.341000e+01 1.216000e+01 9.020000e+00 0.63 1.917600e+02 numeric
part_01_shape_Z_5_2 3.023000e+01 2.696000e+01 2.051000e+01 3.10 4.033000e+02 numeric
part_01_shape_Z_6_1 2.730000e+01 2.665000e+01 1.695000e+01 0.79 3.717000e+02 numeric
part_01_shape_Z_3_1 2.127000e+01 1.774000e+01 1.547000e+01 2.42 2.606700e+02 numeric
part_01_shape_Z_6_0 1.306000e+01 1.346000e+01 8.210000e+00 0.01 2.636200e+02 numeric
part_01_shape_Z_2_1 3.264000e+01 2.569000e+01 2.422000e+01 1.71 3.463800e+02 numeric
part_01_shape_Z_6_3 4.009000e+01 3.784000e+01 2.567000e+01 3.40 5.043900e+02 numeric
part_01_shape_Z_2_0 2.378000e+01 1.850000e+01 1.816000e+01 0.05 2.846000e+02 numeric
part_01_shape_Z_6_2 3.602000e+01 3.467000e+01 2.267000e+01 2.46 4.666900e+02 numeric
part_01_shape_Z_5_0 1.638000e+01 1.587000e+01 9.780000e+00 0.77 2.952300e+02 numeric
part_01_shape_Z_5_1 2.492000e+01 2.270000e+01 1.669000e+01 2.41 3.681300e+02 numeric
part_01_shape_Z_4_2 3.730000e+01 3.316000e+01 2.569000e+01 2.22 4.224900e+02 numeric
part_01_shape_Z_1_0 1.540000e+00 3.100000e-01 1.500000e+00 0.70 4.280000e+00 numeric
part_01_shape_Z_4_1 3.183000e+01 2.898000e+01 2.177000e+01 1.14 3.746300e+02 numeric
part_01_shape_Z_7_2 3.171000e+01 3.046000e+01 1.890000e+01 4.02 4.472700e+02 numeric
part_01_shape_Z_4_0 1.736000e+01 1.655000e+01 1.171000e+01 0.01 2.751700e+02 numeric
part_01_density_Z_7_3 2.800000e+01 2.657000e+01 1.657000e+01 2.60 2.030500e+02 numeric
part_01_density_Z_0_0 1.727000e+01 1.238000e+01 1.375000e+01 0.59 1.086700e+02 numeric
part_01_density_Z_7_0 1.425000e+01 1.315000e+01 7.810000e+00 1.06 1.251700e+02 numeric
part_01_density_Z_7_1 2.077000e+01 1.963000e+01 1.162000e+01 1.91 1.556800e+02 numeric
part_01_density_Z_3_0 1.067000e+01 9.470000e+00 6.910000e+00 0.44 8.632000e+01 numeric
part_01_density_Z_5_2 2.343000e+01 2.094000e+01 1.539000e+01 2.07 1.804900e+02 numeric
part_01_density_Z_6_1 2.208000e+01 2.202000e+01 1.397000e+01 0.37 1.685900e+02 numeric
part_01_density_Z_3_1 1.606000e+01 1.353000e+01 1.119000e+01 1.00 1.174500e+02 numeric
part_01_density_Z_6_0 1.133000e+01 1.240000e+01 6.270000e+00 0.01 1.174500e+02 numeric
part_01_density_Z_2_1 2.527000e+01 1.959000e+01 1.919000e+01 0.71 1.709200e+02 numeric
part_01_density_Z_6_3 3.089000e+01 2.989000e+01 1.986000e+01 1.05 2.519900e+02 numeric
part_01_density_Z_2_0 1.929000e+01 1.465000e+01 1.493000e+01 0.06 1.281300e+02 numeric
part_01_density_Z_6_2 2.821000e+01 2.768000e+01 1.804000e+01 0.88 2.259600e+02 numeric
part_01_density_Z_5_0 1.400000e+01 1.296000e+01 8.280000e+00 0.79 1.139400e+02 numeric
part_01_density_Z_5_1 1.998000e+01 1.795000e+01 1.294000e+01 1.93 1.648700e+02 numeric
part_01_density_Z_4_2 2.887000e+01 2.558000e+01 2.038000e+01 0.84 2.217200e+02 numeric
part_01_density_Z_1_0 1.530000e+00 3.100000e-01 1.490000e+00 0.62 4.290000e+00 numeric
part_01_density_Z_4_1 2.527000e+01 2.262000e+01 1.793000e+01 0.47 1.753100e+02 numeric
part_01_density_Z_7_2 2.541000e+01 2.429000e+01 1.469000e+01 2.26 1.917000e+02 numeric
part_01_density_Z_4_0 1.495000e+01 1.397000e+01 1.057000e+01 0.01 1.188000e+02 numeric
part_02_shape_segments_count 2.365800e+02 8.451900e+02 9.000000e+00 0.00 4.556400e+04 numeric
part_02_density_segments_count 2.365800e+02 8.451900e+02 9.000000e+00 0.00 4.556400e+04 numeric
part_02_volume 1.947000e+01 3.372000e+01 7.350000e+00 0.00 1.632540e+03 numeric
part_02_electrons 1.284000e+01 2.064000e+01 4.730000e+00 0.00 3.511900e+02 numeric
part_02_mean 6.800000e-01 4.800000e-01 6.000000e-01 0.00 9.760000e+00 numeric
part_02_std 1.900000e-01 3.100000e-01 9.000000e-02 0.00 8.260000e+00 numeric
part_02_max 1.320000e+00 1.600000e+00 8.800000e-01 0.00 4.463000e+01 numeric
part_02_max_over_std 9.480000e+00 7.860000e+00 7.220000e+00 0.00 1.732500e+02 numeric
part_02_skewness 1.900000e-01 3.400000e-01 8.000000e-02 0.00 1.089000e+01 numeric
part_02_parts 1.300000e+00 9.500000e-01 1.000000e+00 0.00 2.600000e+01 numeric
part_02_shape_O3 1.014612e+06 5.732035e+06 4.992546e+04 72.00 1.487394e+09 numeric
part_02_shape_O4 3.440276e+12 2.387292e+14 6.119596e+08 1728.00 8.781097e+16 numeric
part_02_shape_O5 2.085633e+19 4.284804e+21 2.073642e+12 12288.00 2.033831e+24 numeric
part_02_shape_FL 1.662951e+16 3.744737e+18 6.558381e+09 -61.16 1.394306e+21 numeric
part_02_shape_O3_norm 5.800000e-01 5.400000e-01 3.700000e-01 0.22 6.888000e+01 numeric
part_02_shape_O4_norm 9.000000e-02 1.700000e-01 3.000000e-02 0.02 2.169000e+01 numeric
part_02_shape_O5_norm 0.000000e+00 2.000000e-02 0.000000e+00 0.00 6.420000e+00 numeric
part_02_shape_FL_norm 3.600000e-01 9.310000e+00 0.000000e+00 0.00 3.374840e+03 numeric
part_02_shape_I1 1.885939e+09 2.150612e+11 3.103373e+06 186.00 1.057686e+14 numeric
part_02_shape_I2 8.098471e+19 2.193735e+22 1.384962e+12 9312.00 1.018994e+25 numeric
part_02_shape_I3 4.906230e+22 2.008981e+25 3.838129e+12 7092.00 1.118111e+28 numeric
part_02_shape_I4 9.749810e+15 1.969465e+18 3.736084e+09 -21.56 7.546862e+20 numeric
part_02_shape_I5 5.163344e+15 8.306804e+17 1.020489e+09 0.00 3.282730e+20 numeric
part_02_shape_I6 8.204262e+17 2.860271e+20 6.931835e+10 4464.00 1.572613e+23 numeric
part_02_shape_I1_norm 1.060000e+00 1.552000e+01 2.300000e-01 0.06 8.356550e+03 numeric
part_02_shape_I2_norm 9.200000e-01 1.095600e+02 1.000000e-02 0.00 4.807317e+04 numeric
part_02_shape_I3_norm 2.581300e+02 1.139011e+05 3.000000e-02 0.00 6.981922e+07 numeric
part_02_shape_I4_norm 3.300000e-01 9.590000e+00 0.000000e+00 0.00 3.367000e+03 numeric
part_02_shape_I5_norm 3.100000e-01 9.840000e+00 0.000000e+00 0.00 3.361780e+03 numeric
part_02_shape_I6_norm 3.780000e+00 9.674900e+02 4.000000e-02 0.00 5.755581e+05 numeric
part_02_shape_M000 2.613920e+03 4.162100e+03 1.185000e+03 32.00 2.040670e+05 numeric
part_02_shape_CI 3.000000e-02 4.710000e+00 1.000000e-02 -153.75 1.470400e+02 numeric
part_02_shape_E3_E1 2.700000e-01 2.100000e-01 2.300000e-01 0.00 1.000000e+00 numeric
part_02_shape_E2_E1 4.400000e-01 2.500000e-01 4.400000e-01 0.00 1.000000e+00 numeric
part_02_shape_E3_E2 5.800000e-01 2.300000e-01 5.900000e-01 0.01 1.000000e+00 numeric
part_02_shape_sqrt_E1 7.110000e+00 5.900000e+00 5.110000e+00 0.87 2.018100e+02 numeric
part_02_shape_sqrt_E2 3.790000e+00 2.650000e+00 3.060000e+00 0.57 2.965000e+01 numeric
part_02_shape_sqrt_E3 2.500000e+00 1.350000e+00 2.300000e+00 0.42 1.822000e+01 numeric
part_02_density_O3 5.898117e+05 2.142951e+06 3.142096e+04 9.54 3.232489e+08 numeric
part_02_density_O4 7.561160e+11 1.618552e+13 2.389454e+08 26.47 6.115114e+15 numeric
part_02_density_O5 5.906779e+17 5.133060e+19 4.901528e+11 19.58 2.094861e+22 numeric
part_02_density_FL 1.765939e+15 3.193372e+17 2.240732e+09 -23.33 1.536992e+20 numeric
part_02_density_O3_norm 7.800000e-01 1.120000e+00 5.500000e-01 0.03 3.829800e+02 numeric
part_02_density_O4_norm 1.600000e-01 4.400000e-01 7.000000e-02 0.00 1.155500e+02 numeric
part_02_density_O5_norm 1.000000e-02 1.900000e-01 0.000000e+00 0.00 1.130500e+02 numeric
part_02_density_FL_norm 2.510000e+00 3.340100e+02 1.000000e-02 0.00 1.979128e+05 numeric
part_02_density_I1 7.687959e+08 2.338538e+10 1.806941e+06 26.90 9.365256e+12 numeric
part_02_density_I2 6.088224e+18 1.457944e+21 4.777123e+11 184.28 6.907748e+23 numeric
part_02_density_I3 5.462508e+20 1.665822e+23 1.252898e+12 162.25 8.766542e+25 numeric
part_02_density_I4 1.131192e+15 1.880488e+17 1.353432e+09 -5.77 9.375584e+19 numeric
part_02_density_I5 7.080263e+14 1.020142e+17 4.826866e+08 0.00 5.379362e+19 numeric
part_02_density_I6 2.018508e+16 3.972588e+18 2.538124e+10 88.32 1.523784e+21 numeric
part_02_density_I1_norm 3.410000e+00 4.507300e+02 4.700000e-01 0.00 2.583039e+05 numeric
part_02_density_I2_norm 1.456000e+01 3.643240e+03 3.000000e-02 0.00 2.154056e+06 numeric
part_02_density_I3_norm 2.182020e+05 1.076679e+08 1.000000e-01 0.00 6.670879e+10 numeric
part_02_density_I4_norm 2.640000e+00 4.366500e+02 1.000000e-02 0.00 2.648333e+05 numeric
part_02_density_I5_norm 2.720000e+00 5.061100e+02 0.000000e+00 0.00 3.094470e+05 numeric
part_02_density_I6_norm 3.696800e+02 1.629474e+05 1.200000e-01 0.00 9.891203e+07 numeric
part_02_density_M000 1.724400e+03 2.543160e+03 7.908900e+02 2.54 4.389833e+04 numeric
part_02_density_CI 3.000000e-02 5.250000e+00 1.000000e-02 -166.26 1.675400e+02 numeric
part_02_density_E3_E1 2.700000e-01 2.200000e-01 2.300000e-01 0.00 1.000000e+00 numeric
part_02_density_E2_E1 4.400000e-01 2.500000e-01 4.400000e-01 0.00 1.000000e+00 numeric
part_02_density_E3_E2 5.800000e-01 2.300000e-01 5.900000e-01 0.01 1.000000e+00 numeric
part_02_density_sqrt_E1 6.870000e+00 5.780000e+00 4.810000e+00 0.87 2.017100e+02 numeric
part_02_density_sqrt_E2 3.630000e+00 2.560000e+00 2.870000e+00 0.57 2.875000e+01 numeric
part_02_density_sqrt_E3 2.390000e+00 1.290000e+00 2.190000e+00 0.42 1.768000e+01 numeric
part_02_shape_Z_7_3 3.261000e+01 3.051000e+01 2.041000e+01 5.84 4.144200e+02 numeric
part_02_shape_Z_0_0 2.004000e+01 1.438000e+01 1.682000e+01 2.76 2.207200e+02 numeric
part_02_shape_Z_7_0 1.521000e+01 1.395000e+01 8.970000e+00 0.91 2.288400e+02 numeric
part_02_shape_Z_7_1 2.306000e+01 2.168000e+01 1.346000e+01 3.76 3.197400e+02 numeric
part_02_shape_Z_3_0 1.239000e+01 1.114000e+01 8.500000e+00 0.66 1.951600e+02 numeric
part_02_shape_Z_5_2 2.730000e+01 2.438000e+01 1.902000e+01 3.98 3.446100e+02 numeric
part_02_shape_Z_6_1 2.438000e+01 2.434000e+01 1.569000e+01 0.97 3.223400e+02 numeric
part_02_shape_Z_3_1 1.930000e+01 1.613000e+01 1.458000e+01 2.62 2.619600e+02 numeric
part_02_shape_Z_6_0 1.186000e+01 1.257000e+01 7.690000e+00 0.00 1.878500e+02 numeric
part_02_shape_Z_2_1 2.896000e+01 2.320000e+01 2.248000e+01 1.59 3.159100e+02 numeric
part_02_shape_Z_6_3 3.586000e+01 3.442000e+01 2.391000e+01 3.18 4.571400e+02 numeric
part_02_shape_Z_2_0 2.096000e+01 1.676000e+01 1.673000e+01 0.06 2.567500e+02 numeric
part_02_shape_Z_6_2 3.207000e+01 3.148000e+01 2.094000e+01 2.17 4.218400e+02 numeric
part_02_shape_Z_5_0 1.525000e+01 1.442000e+01 9.070000e+00 0.88 2.418300e+02 numeric
part_02_shape_Z_5_1 2.251000e+01 2.045000e+01 1.537000e+01 2.71 3.003900e+02 numeric
part_02_shape_Z_4_2 3.304000e+01 3.008000e+01 2.360000e+01 2.23 3.798200e+02 numeric
part_02_shape_Z_1_0 1.650000e+00 4.000000e-01 1.630000e+00 0.67 4.980000e+00 numeric
part_02_shape_Z_4_1 2.800000e+01 2.626000e+01 1.975000e+01 0.88 3.450400e+02 numeric
part_02_shape_Z_7_2 2.887000e+01 2.744000e+01 1.746000e+01 4.53 3.775500e+02 numeric
part_02_shape_Z_4_0 1.535000e+01 1.518000e+01 1.055000e+01 0.01 2.135400e+02 numeric
part_02_density_Z_7_3 2.684000e+01 2.500000e+01 1.618000e+01 3.21 2.001800e+02 numeric
part_02_density_Z_0_0 1.628000e+01 1.168000e+01 1.374000e+01 0.78 1.023700e+02 numeric
part_02_density_Z_7_0 1.410000e+01 1.240000e+01 8.520000e+00 1.21 1.229800e+02 numeric
part_02_density_Z_7_1 2.008000e+01 1.846000e+01 1.143000e+01 1.99 1.516200e+02 numeric
part_02_density_Z_3_0 1.039000e+01 9.020000e+00 6.870000e+00 0.53 8.338000e+01 numeric
part_02_density_Z_5_2 2.231000e+01 1.975000e+01 1.507000e+01 2.33 1.784800e+02 numeric
part_02_density_Z_6_1 2.051000e+01 2.094000e+01 1.312000e+01 0.46 1.621400e+02 numeric
part_02_density_Z_3_1 1.541000e+01 1.282000e+01 1.119000e+01 1.96 1.138300e+02 numeric
part_02_density_Z_6_0 1.065000e+01 1.200000e+01 5.880000e+00 0.01 1.151500e+02 numeric
part_02_density_Z_2_1 2.364000e+01 1.850000e+01 1.891000e+01 0.78 1.652800e+02 numeric
part_02_density_Z_6_3 2.891000e+01 2.831000e+01 1.914000e+01 0.98 2.429600e+02 numeric
part_02_density_Z_2_0 1.793000e+01 1.391000e+01 1.461000e+01 0.03 1.205600e+02 numeric
part_02_density_Z_6_2 2.624000e+01 2.618000e+01 1.718000e+01 0.77 2.156300e+02 numeric
part_02_density_Z_5_0 1.361000e+01 1.226000e+01 8.200000e+00 0.64 1.106500e+02 numeric
part_02_density_Z_5_1 1.900000e+01 1.687000e+01 1.257000e+01 1.68 1.624200e+02 numeric
part_02_density_Z_4_2 2.679000e+01 2.426000e+01 1.968000e+01 0.78 2.142100e+02 numeric
part_02_density_Z_1_0 1.650000e+00 4.000000e-01 1.620000e+00 0.61 4.960000e+00 numeric
part_02_density_Z_4_1 2.324000e+01 2.147000e+01 1.712000e+01 0.40 1.689100e+02 numeric
part_02_density_Z_7_2 2.429000e+01 2.277000e+01 1.425000e+01 2.61 1.884000e+02 numeric
part_02_density_Z_4_0 1.371000e+01 1.346000e+01 9.920000e+00 0.01 1.186200e+02 numeric
resolution 2.150000e+00 5.400000e-01 2.070000e+00 0.48 8.200000e+00 numeric
FoFc_mean 0.000000e+00 0.000000e+00 0.000000e+00 0.00 0.000000e+00 numeric
FoFc_std 1.300000e-01 5.000000e-02 1.200000e-01 0.01 9.400000e-01 numeric
FoFc_square_std 2.000000e-02 2.000000e-02 1.000000e-02 0.00 8.900000e-01 numeric
FoFc_min -7.000000e-01 3.000000e-01 -6.600000e-01 -7.55 -4.000000e-02 numeric
FoFc_max 2.600000e+00 2.540000e+00 1.840000e+00 0.04 4.526000e+01 numeric

8 Ograniczenie zbioru do 50 najpopularniejszych wartości res_name.

9 50 najpopularniejszych wartosci res_name oraz ilość przykładów.

res_name Przyklady
SO4 38757
GOL 27615
EDO 21169
NAG 17941
CL 15627
CA 14377
ZN 13568
MG 9960
HEM 7397
PO4 7336
ACT 5430
DMS 4601
IOD 4367
PEG 3455
NAD 3278
K 3220
FAD 3100
MN 2824
CLA 2654
ADP 2574
MLY 2413
NAP 2327
CD 2264
UNX 2171
MPD 2165
PG4 2098
MAN 1955
FMT 1951
MES 1852
1PE 1543
ATP 1523
CU 1518
COA 1497
BR 1454
FMN 1439
EPE 1374
NDP 1312
PGE 1261
HEC 1236
NI 1167
TRS 1152
NO3 1144
ACY 1138
SF4 1132
FE 1085
SAH 1084
PLP 1067
GDP 1062
UNK 1032
C8E 1020

10 Wykresy rozkladow:

10.1 Liczby atomow.

10.2 Liczby elektronow.

11 Korelacje miedzy zmiennymi.

Tabela przedstawia korelację pomiędzy poszczególnymi liczbowymi kolumnami zbioru przy pomocy funkcji cor(), wyświetlając wyniki gdzie wartość bezwzględna jest większa od 0,6.

11.1 Tabela korelacji.

11.2 Mapa ciepła dla wybranych kolumn.

12 10 klas z najwieksza niezgodnoscia:

Niezgodnosc obliczona za pomoca zsumowanej ilości wierszy w których występuje różnica pomiędzy wartościami.

12.1 Liczby atomów.

res_name Niezgodnosc
NAG 17507
MLY 2395
MAN 1763
UNK 1032
PLP 933
CLA 847
1PE 711
C8E 564
PG4 489
NAP 214

12.2 Liczby elektronow.

res_name Niezgodnosc
NAG 17507
MLY 2395
MAN 1763
UNK 1032
PLP 933
CLA 847
1PE 711
C8E 564
PG4 489
NAP 214

13 Rozklad wartosci kolumn part_01.

14 Regresja liniowa:

14.1 Dla liczby atomów.

set.seed(123)

reg_at<-cor_pdb%>%filter(X2=="local_res_atom_non_h_count", X1!=c("dict_atom_non_h_count", "dict_atom_non_h_electron_sum"))
reg_at_names <- reg_at[,1]

regresja_at <- pdb_clear_last%>%
  select(reg_at_names, local_res_atom_non_h_count)

idx_at <- createDataPartition(pdb_clear_last$local_res_atom_non_h_count,
                           p=0.7, list=F)
training_at <- pdb_clear_last[idx_at,]
testing_at <- pdb_clear_last[-idx_at,]

control <- trainControl(method="repeatedcv", number=2, repeats = 5)

fit_at <- train(local_res_atom_non_h_count ~ ., data=training_at, method="glm", metric="RMSE", trControl=control)
predAt<- predict(fit_at, newdata=testing_at)
postResample(predAt,testing_at$local_res_atom_non_h_count)
##       RMSE   Rsquared        MAE 
## 0.13184190 0.99990067 0.02184218

Wartość RMSE zbliżona do 0. Dla miary R^2 również uzyskano zadowalający wynik zbliżony do 1.

14.2 Dla liczby elektronów.

set.seed(123)

reg_el<-cor_pdb%>%filter(X2=="local_res_atom_non_h_electron_sum", X1!=c("dict_atom_non_h_count", "dict_atom_non_h_electron_sum", "local_res_atom_non_h_electron_sum"))
reg_el_names <- reg_at[,1]

regresja_el <- pdb_clear_last%>%
  select(reg_el_names, local_res_atom_non_h_count)


idx_el <- createDataPartition(pdb_clear_last$local_res_atom_non_h_electron_sum,
                           p=0.7, list=F)
training_el <- pdb_clear_last[idx_el,]
testing_el <- pdb_clear_last[-idx_el,]

control <- trainControl(method="repeatedcv", number=2, repeats = 5)

fit_el <- train(local_res_atom_non_h_electron_sum ~ ., data=regresja_el, method="glm", metric="RMSE", trControl=control)
predEl<- predict(fit_el, newdata=testing_el)
postResample(predEl,testing_el$local_res_atom_non_h_electron_sum)
##       RMSE   Rsquared        MAE 
## 12.9018846  0.9791096  8.8296217

Wartość miary RMSE niezadowalająca, nie udało się uzyskać dobrego wyniku dla liczby elektronów.

15 Klasyfikator Random Forest dla wartości res_name.

15.1 Dla ntree = 5.

Ze względu na bardzo długi czas przetwarzania danych, rozmiary zbioru zostały zmniejszone, a parametr ntree został ograniczony do 5. Dla osiągnięcia lepszych wyników próbowano zwiększyć parametr ntree, ale już dla wartości podstawej 500 przy wykorzystywaniu 1000 wierszy zbioru czas ładowania przekraczał godzinę.

pdb_clear_50_rf <- pdb_clear_50%>%select(-dict_atom_non_h_electron_sum, -dict_atom_non_h_count,-part_00_density_segments_count, -part_01_density_segments_count,-part_02_density_segments_count, -local_res_atom_non_h_electron_sum,-local_res_atom_non_h_count)


ogranicz<-createDataPartition(pdb_clear_50_rf$res_name,
                           p=0.9, list=F)


pdb_clear_50_rf<-pdb_clear_50_rf[-ogranicz,]

idx_kl <- createDataPartition(pdb_clear_50_rf$res_name,
                           p=0.7, list=F)

training_kl <- pdb_clear_50_rf[idx_kl,]

testing_kl <- pdb_clear_50_rf[-idx_kl,]

control <- trainControl(method="repeatedcv", number=2, repeats = 5)

set.seed(123)

fit_kl <- train(as.factor(res_name) ~ .,
             data = training_kl,
             method = "rf",
             trControl = control,
             ntree = 5
          )

rfRes <- predict(fit_kl, newdata = testing_kl)

cm_1<-confusionMatrix(data = rfRes, 
                factor(testing_kl[,1]))

cm_1$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.3403444      0.2878040      0.3296938      0.3511175      0.1527540 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

Uzyskany wynik nie jest zadowalający. Wynika on z ograniczenia ilości próbek oraz bardzo małej wartości parametru ntree.